Theoretical Analysis of the k-Means Algorithm - A Survey

نویسندگان

  • Johannes Blömer
  • Christiane Lammersen
  • Melanie Schmidt
  • Christian Sohler
چکیده

Clustering is a basic process in data analysis. It aims to partition a set of objects into groups called clusters such that, ideally, objects in the same group are similar and objects in different groups are dissimilar to each other. There are many scenarios where such a partition is useful. It may, for example, be used to structure the data to allow efficient information retrieval, to reduce the data by replacing a cluster by one or more representatives or to extract the main ‘themes’ in the data. There are many surveys on clustering algorithms, including well-known classics [45, 48] and more recent ones [24, 47]. Notice that the title of [47] is Data clustering: 50 years beyond K-means in reference to the k-means algorithm, the probably most widely used clustering algorithm of all time. It was proposed in 1957 by Lloyd [58] (and independently in 1956 by Steinhaus [71]) and is the topic of this survey. The k-means algorithm solves the problem of clustering to minimize the sum of squared errors (SSE). In this problem, we are given a set of points P ⊂ Rd in a Euclidean space, and the goal is to find a set C ⊂ Rd of k points (not necessarily included in P ) such that the sum of the squared distances of the points in P to their nearest center in C is minimized. Thus, the objective function to be minimized is

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

شناسایی الگوی رفتار مردم در اهدای خون با استفاده از الگوریتم K-Means مبتنی بر تازگی، بسامد و ارزش خون

Introduction: Blood donation rate in developed countries is 18 times higher than developing countries. It is estimated that if only five percent of Iran population embark on blood donation, it will be adequate to meet the needs of the community. The aim of this paper is to identify the blood donators’ loyalty behavior for proper planning to extend and enhance blood donation habits among t...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

Combination of Transformed-means Clustering and Neural Networks for Short-Term Solar Radiation Forecasting

In order to provide an efficient conversion and utilization of solar power, solar radiation datashould be measured continuously and accurately over the long-term period. However, the measurement ofsolar radiation is not available to all countries in the world due to some technical and fiscal limitations. Hence,several studies were proposed in the literature to find mathematical and physical mod...

متن کامل

A hybrid DEA-based K-means and invasive weed optimization for facility location problem

In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016